Use various classifiers (e.g. l1-penalized and ridge logistic regression, and SVM, xgboost), for a hotspot detection model.
Geographical levels Two levels: state and county.
Data Between 2020-05-01 and 2020-08-25, take the
Then, form a covariate matrix by time-lagging these by 0 through 28 days. Additionally, add features by calculating "slopes of JHU IR from the past x days", where \(x \in \{3,6,9,12,15..\}\).
The resulting covariate matrix is roughly 3162 by 193 at the state level, and 52093 by 193 at the county level. We have 953 different counties in our sample.
Response data
Training and test data This is done by splitting the data into training/test set at a ratio of 70%/30%, by geographical level. The training set is used for training the hotspot model (e.g. for cross-validation of glmnet() or training svm()), and the test data is used only for creating the ROC or adjusted ROC curves.
Models L1- and L2- penalized logistic regression, SVM, xgboost.
Cross-validation for glmnet is done in a block-wise fashion in time; five consecutive time blocks are used as test sets (folds).
First, visualizing the hotspots for the state level data:
The distribution of 0's and 1's in the entire state level data is
| Response | Count |
|---|---|
| 0 | 2705 |
| 1 | 457 |
training set:
| Response | Count |
|---|---|
| 0 | 1906 |
| 1 | 326 |
test set
| Response | Count |
|---|---|
| 0 | 799 |
| 1 | 131 |
Using n_ahead=28, the performances of four different classifiers can be visualized as ROC curves. Here, thick transparent lines are the results using Facebook data, and thin lines are those excluding Facebook data.
We can also see population-weighted ROC curves:
Given the class imbalance, we can look at precision-recall plots:
And now the same precision-recall plot, but considering a logged x axis:
Now, choosing a summary statistic to measure the advantage in performance using Facebook data, we'll calculate difference in the area under the curve (AUC) between the two versions, for each classifier, as a function of how far ahead hotspots are defined as i.e. for a range of n_ahead, in \(\{10, \cdots, 30\}\):
Interestingly, the advantage of using Facebook data increases as we try to predict a hotspot in a date further in the future.
The analysis depends a bit on how the training/test split is made among geographical regions. If we use a different random seed for splitting, the ROC curves can differ quite a bit:
While the classifier performance varies by seed, the advantage that FB grants is clearly always present.
The numbers in the county level data are:
training set:
| Response | Count |
|---|---|
| 0 | 30611 |
| 1 | 6063 |
test set
| Response | Count |
|---|---|
| 0 | 12826 |
| 1 | 2593 |
Using n_ahead=28, the performances of four different classifiers can be visualized as ROC curves. Here, thick transparent lines are the results using Facebook data, and thin lines are those excluding Facebook data.
And population weighted:
The precision-recall plot:
And the precision-recall plot considering a logged x axis:
Now, choosing a summary statistic to measure the advantage in performance using Facebook data, we'll calculate difference in the area under the curve (AUC) between the two versions, for each classifier, as a function of how far ahead hotspots are defined as i.e. for a range of n_ahead, in \(\{10, \cdots, 30\}\):
Some variables and functions:
lags are the number of lagged values taken of the JHU cases and FB. Current default is lags=28. slope = TRUE adds slopes to the feature matrix that is composed by lagged valuesn_ahead is the number of days we want to look ahead, in forming the response data (0 or 1) for hotspot detection. Current default choice is n_ahead=28.n_ahead days from now, to the average \(\bar y_1\) of the past 1 week period (again, from now).threshold percent (25%).covidcast_signals() from the R package covidcast allows the data collection of multiple sources.read_to_model() further cleans + readies the data into a \(y|X\) type matrix, to feed into functions like glmnet().onset = FALSE. If onset = TRUE, a hotspot (from the definition of our response variable) will only be considered a hotspot if it has not been a hotspot recently.First, we download and form the data.
## Setup
lags = 28
n_ahead = 21 ## 28
threshold = 0.25
geo_type = "state" ## or "county"
response = "confirmed_7dav_incidence_prop"
fn_response = response_diff_avg_1week_min20
fn_response_name = "response_diff_avg_1week_min20"
slope = TRUE
onset = FALSE
split_type = "geo"
## Read in data once
data_sources = c("indicator-combination",
"fb-survey")
signals = c("confirmed_7dav_incidence_prop",
"smoothed_hh_cmnty_cli")
start_day = as.Date("2020-05-01")
end_day = as.Date("2020-08-30")
signals = data.frame(data_sources = data_sources, signals = signals)
suppressMessages({
mat = covidcast_signals(signals,
start_day = start_day, end_day = end_day, geo_type = geo_type)
})
mat <- mat %>% select(geo_value, time_value, signal, data_source, value)
## Form the y|X matrix
source('helpers.r')
t0 <- Sys.time()
t0
## [1] "2020-09-15 15:20:05 PDT"
df_model <- ready_to_model(mat, lags, n_ahead, response, slope, fn_response, threshold, onset)
Sys.time() - t0
## Time difference of 14.40638 secs
df_model %>% names() %>% print()
## [1] "geo_value"
## [2] "time_value"
## [3] "feature_lag0_confirmed_7dav_incidence_prop_indicator-combination"
## [4] "feature_lag0_smoothed_hh_cmnty_cli_fb-survey"
## [5] "feature_lag1_confirmed_7dav_incidence_prop_indicator-combination"
## [6] "feature_lag1_smoothed_hh_cmnty_cli_fb-survey"
## [7] "feature_lag2_confirmed_7dav_incidence_prop_indicator-combination"
## [8] "feature_lag2_smoothed_hh_cmnty_cli_fb-survey"
## [9] "feature_lag3_confirmed_7dav_incidence_prop_indicator-combination"
## [10] "feature_lag3_smoothed_hh_cmnty_cli_fb-survey"
## [11] "feature_lag4_confirmed_7dav_incidence_prop_indicator-combination"
## [12] "feature_lag4_smoothed_hh_cmnty_cli_fb-survey"
## [13] "feature_lag5_confirmed_7dav_incidence_prop_indicator-combination"
## [14] "feature_lag5_smoothed_hh_cmnty_cli_fb-survey"
## [15] "feature_slope1_confirmed_7dav_incidence_prop_indicator-combination"
## [16] "feature_slope1_smoothed_hh_cmnty_cli_fb-survey"
## [17] "feature_slope2_confirmed_7dav_incidence_prop_indicator-combination"
## [18] "feature_slope2_smoothed_hh_cmnty_cli_fb-survey"
## [19] "feature_slope3_confirmed_7dav_incidence_prop_indicator-combination"
## [20] "feature_slope3_smoothed_hh_cmnty_cli_fb-survey"
## [21] "feature_slope4_confirmed_7dav_incidence_prop_indicator-combination"
## [22] "feature_slope4_smoothed_hh_cmnty_cli_fb-survey"
## [23] "feature_slope5_confirmed_7dav_incidence_prop_indicator-combination"
## [24] "feature_slope5_smoothed_hh_cmnty_cli_fb-survey"
## [25] "feature_slope6_confirmed_7dav_incidence_prop_indicator-combination"
## [26] "feature_slope6_smoothed_hh_cmnty_cli_fb-survey"
## [27] "feature_slope7_confirmed_7dav_incidence_prop_indicator-combination"
## [28] "feature_slope7_smoothed_hh_cmnty_cli_fb-survey"
## [29] "feature_slope8_confirmed_7dav_incidence_prop_indicator-combination"
## [30] "feature_slope8_smoothed_hh_cmnty_cli_fb-survey"
## [31] "feature_slope9_confirmed_7dav_incidence_prop_indicator-combination"
## [32] "feature_slope9_smoothed_hh_cmnty_cli_fb-survey"
## [33] "resp"
## save(df_model, mat, s, n_ahead, threshold, geo_type, response, fn_response, fn_response_name, slope, onset, split_type,
## file = file.path(outputdir, "data-county.Rdata"))
## load(file = file.path(outputdir, "data-county.Rdata"))
## Further split into training and test
geo_split_seed = 100
geo_cv_split_seed = 10000
splitted <- sample_split_geo(df_model, pct_test = 0.3, seed = geo_split_seed)
## See the distribution of 1's and 0's
df_model %>% select(resp) %>% table() %>% print()
## .
## 0 1
## 4425 397
splitted$df_train %>% select(resp) %>% table() %>% print()
## .
## 0 1
## 3239 239
splitted$df_test %>% select(resp) %>% table() %>% print()
## .
## 0 1
## 1186 158
foldid <- make_foldid_geo(splitted$df_train, nfold = 5, geo_cv_split_seed)
for(ifold in 1:5){
splitted$df_train[which(foldid==ifold),] %>% select(resp) %>% table() %>% print()
}
## .
## 0 1
## 576 82
## .
## 0 1
## 716 36
## .
## 0 1
## 721 31
## .
## 0 1
## 628 30
## .
## 0 1
## 598 60
Now, fit the hot-spot models and visualize the model performances.
make_plots(destin = outputdir,
splitted, lags, n_ahead, geo_type, fn_response_name, threshold, slope, split_type, onset, geo_cv_split_seed)
Also visualize the 1's and 0's in the state level data (highlighting by each CV fold and test data, in different color boxes):
## geos = df_model %>% select(geo_value) %>% unlist() %>% unique()
geo_split_seed = 100
geo_cv_split_seed = 10000
nfold = 5
foldid <- make_foldid_geo(splitted$df_train, nfold = nfold, seed = geo_cv_split_seed)
state_splits = lapply(1:nfold, function(ifold)
splitted$df_train[which(foldid==ifold),] %>% select(geo_value) %>% unlist() %>% unique())
nn1 = 14
nn2 = 4
par(mfrow = c(nn1, nn2)); par(mar = c(3, 3, 1, 1)); par(cex=.7)
## iilist = 1:length(geos)
geos = lapply(1:nfold, function(ifold){
splitted$df_train[which(foldid==ifold),] %>% select(geo_value) %>% unique()}) %>%
unlist()
test_geos = splitted$df_test %>% select(geo_value) %>% unlist() %>% unique()
geos = c(geos, test_geos)
for(geo in geos){
## Form the response data (0's and 1's) for all geos
dat0 = df_model %>% subset(geo_value==geo) %>%
select(time_value,
resp,
incidence = "feature_lag0_confirmed_7dav_incidence_prop_indicator-combination" )
## Combine it with original data for this geo (all time points)
dat = mat %>% as_tibble() %>% filter(geo_value==geo, signal =="confirmed_7dav_incidence_prop") %>%
select(time_value, incidence = value)
dat_combined = dat %>% full_join(dat0, by = c("time_value", "incidence"))
dat_combined = dat_combined %>% mutate(resp=replace(resp, is.na(resp), 0))
## Make the plot
dat_combined %>% with(plot(time_value, incidence, col = resp+1, type='o', pch=16, ylim= c(0,100)))
abline(h = 20, col='grey50', lwd=3)
legend("topleft", legend=toupper(geo), bty='n', cex=3)
## Also add legend for fold id (really bad code)
splitnum = sapply(state_splits, function(mysplit) geo %in% mysplit) %>% which()
if(length(splitnum) == 0){
splitnum = "TEST"
} else {
box(lty=1, col=splitnum + 1, lwd=3)
}
legend("topright", legend = paste0("\nfold ", splitnum),
bty="n", cex=2)
}